img = image_read("../Workflow.png")
image_trim(img)Metabolon Work Flow
Introduction
The purpose of this pipeline is to provide a set of additional analyses to complement the analysis done by Metabolon. These additional analysis are broken down into six main sections.
Peak normalization and standardization (default off)
Analysis data creation
Data exploration
Subpathway Analysis
Pairwise comparisons
Box plot and line plots
The overall structure for each of these modules is shown in the figure below.
The R “chunks” separate each of the analysis steps within the main sections of the pipeline and each chunk is tied to a .R file in the Code folder. In some chunks, there may need to be lines of code that need to change from experiment to experiment. This pipeline provides a list of steps for each chunk and highlights places where you need to make changes.
Link and folder set up
To get started, in the Metabolomics pipeline working directory, there are five folders with the following folder structure:
- Data
- Metabolon: Contains the .xlsx file from Metabolon and other files related to the metadata.
- Processed: Contains data processed from this pipeline.
- Code: Contains the .R files needed for this pipeline
- Outputs
- Plots
- Tables
- Reports
- R: Contains and .R file with additional functions used for this pipeline.
- renv: This folder is generated by renv to restore package versions.
You should save the .xlsx file from Metabolon and any additional metadata files to the Data/Metabolon folder. The main output folder will contain the plots and tables generated from this pipeline. Additionally, you can run this pipeline as a report and save it as a .pdf or .html. The report saves to the “Outputs/Reports” folder.
Normalization and Standardization (default=off)
Metabolome include peak data that has been normalized and standardized as part of the .xlsx Metabolome data file. We recommend you use the normalized peak data from Metabolome and therefore, the normalization and standardization steps are “off” by default.. However, is some cases you may need to perform normalization differently than what was provided by Metabolome. An example of this is if your expirement was run in two different batches. Metabolome will perform the normalization for each batch seperatly. This can cause downstream issues and we recomomned combining the raw data and performing one normalization on both batches. To turn on the normalization steps set eval=TRUE.
In its raw form, the peak data contain counts for each metabolite in each sample. Standardization and normalization are important steps to improve the signal-to-noise ratio. In the next section, we are implementing the same standardization and normalization methods used by Metabolon. First we will load the raw peak data into R. To start we:
# 1 Provide a path to Metabolon .xlsx file.
metabolon_path <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"- 1
- Provide a path with the file name to the Data tables .xlsx file provided by Metabolon. You may need to change this file path to reflect the name of the data tables .xlsx file for your project.
# 2. Load raw peak data
peak_data <- read.xlsx(metabolon_path, sheet = "Peak Area Data")- 2
- Read in the “Peak Area Data” tab from the .xlsx file provided above.
Median Standardization
The first step in the Metabolon normalization process is median standardization. This step will ensure that each metabolite in the dataset has the same median value. To show this we will show the box plots for 5 metabolites before and after median normalization. Below is the box plots before median standardization where we:
# 1. Select the first five metabolites for the box plot.
metabolites <- names(peak_data)[2:6]
# 2. Create the boxplot data
plot_data <- peak_data %>%
reshape2::melt(id.vars="PARENT_SAMPLE_NAME") %>%
filter(variable %in% metabolites)
# 3. Plot the box plots for each of the five metabolites.
ggplot(plot_data,aes(x=variable,y=value)) +
geom_boxplot() +
theme_bw()- 1
- Select the first five metabolites for the box plot.
- 2
- Create the boxplot data
- 3
- Plot the box plots for each of the five metabolites.
Now we perform median standardization.
# 1. Initialize a new peak_data_med matrix
peak_data_norm <- peak_data
# 2. Create a matrix containing the median value for each metabolite
peak_data_med <- peak_data_norm %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise_all(median, na.rm = T)
# 3. Divide each value for each metabolite by the median value of that metabolite
for(i in colnames(peak_data_med)){
peak_data_norm[,i] <- peak_data_norm[,i]/peak_data_med[,i]
}- 1
- Initializing a new peak_data_norm matrix
- 2
- Create a data set containing the median value for each metabolite
- 3
- Divide each value by the median metabolite value.
After the median standardization we can look at those same metabolites from before. We will notice that the median value for all metabolites are now lined up at 1.
# 1. Select the first five metabolites for the box plot.
metabolites <- names(peak_data)[2:6]
# 2. Create the boxplot data
plot_data <- peak_data_norm %>%
reshape2::melt(id.vars="PARENT_SAMPLE_NAME") %>%
filter(variable %in% metabolites)
# 3. Plot the box plots for each of the five metabolites.
ggplot(plot_data,aes(x=variable,y=value)) +
geom_boxplot() +
theme_bw()- 1
- Select the first five metabolites for the box plot.
- 2
- Create the boxplot data
- 3
- Plot the box plots for each of the five metabolites.
Minimum value imputation
Once we standardize each metabolite by the median value, we impute the missing values for each metabolite with the minimum value for that metabolite.
# 1. Initialize the new peak_data_imputed matrix
peak_data_imputed <- peak_data_norm
# 2. Find the minimum value for each metabolite and compute 1/5 of that value
peak_data_mins <- peak_data_imputed %>%
select(-PARENT_SAMPLE_NAME) %>%
summarise_all(min, na.rm = T)
# 3. Impute the value
for(i in colnames(peak_data_mins)){
if(length(peak_data_imputed[,i][is.na(peak_data_imputed[,i])]) > 0){
peak_data_imputed[which(is.na(peak_data_imputed[,i])),i] <- as.numeric(peak_data_mins[i])
}
}- 1
- Initialize the new peak_data_imputed matrix
- 2
- Find the minimum value for each metabolite.
- 3
- Create a for loop such that if a metabolite has a missing observation, then the minimum value for that metabolite is imputed.
Log Transformation
The final step is to take a natural log transformation of each of the values.
# 1. Log transform all of the values
peak_data_log <- peak_data_imputed %>%
mutate_if(is.numeric, log)- 1
- Log transform all of the values
Save data
Now that we have imputed and normalized our data, we need to save the data for subsequent sections. We will save peak_data_log in the “Data/Processed” Folder under the file name “Log_transformed_data.csv”.
# 1. Save log transformed data.
write.csv(peak_data_log, file = "../Data/Processed/Log_transformed_data.csv",
row.names = F)- 1
- Save log transformed data.
Analysis Data Creation
For the downstream analysis we will need merge the sample metadata and the log transformed peak data to create one analysis dataset. In the normalization and standardization section we took the raw peak data and performed median value standardization, minimum value imputation and natural log transformation. Metabolon does these preprocessing steps for their customers and provides the log transformed data for their customers in a .xlsx. We recommend utilizing the preprocessed log transformed preak data provided by Metabolon.
We need to merge the sample metadata and the Log normalized data together. To do this we are going to read in the sample metadata and the log normalized data by:
# 1. Metabolon excel file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"- 1
- Provide a path to the metabolon .xlsx file.
# 2. Read in sample metadata
meta_data <- read.xlsx(met_excel,sheet = "Sample Meta Data")
# 3. Read Log transformed data
peak_data_log <- read.xlsx(met_excel,sheet = "Log Transformed Data")- 2
- Read in the sample metadata under the “Sample Meta Data” tab.
- 3
- Read in the Log transformed data from the “Log Transformed Data” tab.
If there are additional variables that need to be included into the analysis, you can place these variables in a .xlsx file. You must include a variable called “PARENT_SAMPLE_NAME” that links the metadata to each of the samples.
Please provide a path to the additional metadata variables. If you do not have any additional meta variables set additional_meta=FALSE.
# If you have additional meta data set additional_meta=T
additional_meta=T
# 1. Provide path to additional metadata
additional_metaPath <- "../Data/Metabolon/AdditionalVars.xlsx"- 1
- Set this variable to TRUE if you have additional metadata variables
- 2
- Provide a path to the additional metadata variables. If you do not have any additional metadata variables, you can set the “additional_meta=NULL”.
If additional_meta=TRUE we will merge the metadata with the Metabolome metadata.
# 2. Merge additional vars to the meta data
if(!is.null(additional_meta)){
# Read in additional meta data
meta_data_additional <- read.xlsx(additional_metaPath)
meta_data <- meta_data_additional %>%
left_join(meta_data,"PARENT_SAMPLE_NAME")
}- 3
- If there are additional meta data variables, then merge the additional metadata variables with the metadata from Metabolon.
To create the analysis data, we want the metadata and the log transformed peak data merged so each row of the analysis data set is a different sample and the columns contain the sample metadata and the log transformed peak data. To do this we need to select the metadata variables need for down stream analysis. The metadata provided by Metabolon includes variables which may not be necessary for the analysis. In this section will select the variables in the metadata that are necessary for downstream analysis. We will need to identify the variables we want to keep. This will include the PARENT_SAMPLE_NAME and the experiment group variables. Additionally, if your experiment includes males and females, we recommend including a Sex/Gender variable and stratify the analysis.
# 1. Select metadata variables
metadata_variables <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")- 1
- Provided list of metadata variables we are interested in keeping for the downstream analysis.
Once we have the list of desired metadata variables, we can merge the log normalized peak data with the relevent metadata for the downstream analysis.
# 2. Create analysis data
analysis_data <- meta_data %>%
select(all_of(metadata_variables)) %>%
left_join(peak_data_log,"PARENT_SAMPLE_NAME")- 2
- Merge the metadata and the log transformed data together while keeping only the metadata variables needed for the analysis.
Table 1
The goal of this section is to design a table that resembles the study design of your experiment. This step allows us to see descriptive statistics of the samples in our data. For this, we are using the analysis data to define a table with a similar structure as the table below.
| Strat 1 | Strat 2 | |
|---|---|---|
| Var1 | XX | XX |
| Var 2 | XX | XX |
| Var 3 | XX | XX |
Alternatively, we can define a non-stratified table with only the number of observations within each experimental group. The following code will create the desired table 1 for both scenarios.
Here, we will need your input to identify the variables you would like to include in the table 1. Additionally, we can add a strata variable to stratify the table by. Finally, we can add a label to these variables to get them a more meaningful title in the table.
# 1. Choose the meta data variable name.
var1 = "GROUP_NAME"
var2 = "TIME1"
# 2. Choose a single variable to stratify the table by.
stratified_var = "Gender"
# 3. Assign the variable a label name
label(analysis_data[,var1]) <- "Treatment Group"
label(analysis_data[,var2]) <- "Time"
# 4. Assign a stratified variable a label name for the table.
label(analysis_data[,stratified_var]) = "Gender"- 1
- Choose the variable name. Change the name assigned to var1 to match the name of the variable in the analysis data set which you want to include in the table. Repeat this process for as many variables as you desire.
- 2
- Choose a single variable to stratify the table by.
- 3
- Assign the variable a label name, you can change the label of the variable in the table by relabeling the variable to the desired name on the right side of the equation.
- 4
- Assign a stratified variable a label name for the table.
Now that you have properly assigned the variables, we will create table 1. To create the table, we will be using the table1 function. You must change the arguments depending on what you want the table to display. For example: * Table 1 without interaction:
\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}}, \ \text{data = analysis_data} ) \]
- Table 1 with interaction
\[ table1(\sim \underbrace{var_1 + var_2 ...}_{\text{Variable names for rows}} \underbrace{| \ stratified\_var}_{\text{Statified Variable}}, \ \text{data = analysis_data} ) \]
Now you will need to edit the table 1 function to create the desired table.
# 1. Creates table1
tbl1 <- table1(~ TIME1 + GROUP_NAME| Gender
, data = analysis_data)- 1
- Create table 1. In this step you will need to use the same variable names as in the above chunk within the table1 function.
# 2. Displays table 1
tbl1- 2
- Display the table
| Female (N=44) |
Male (N=42) |
Overall (N=86) |
|
|---|---|---|---|
| Time | |||
| End | 14 (31.8%) | 15 (35.7%) | 29 (33.7%) |
| Onset | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| PreSymp | 15 (34.1%) | 13 (31.0%) | 28 (32.6%) |
| Treatment Group | |||
| 1902 | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
| Control | 14 (31.8%) | 14 (33.3%) | 28 (32.6%) |
| WT | 15 (34.1%) | 14 (33.3%) | 29 (33.7%) |
# 3. saves table 1
t1flex(tbl1) %>%
save_as_docx(path = paste0("../Outputs/Tables/","table1.docx"))- 3
- Save the table in the folder “/Outputs/Tables/” with the file name “table1”.
Once we have the analysis data, and we are happy with the table 1 we have created we want to save it so we can use it for the downstream analysis. We will save this in the “Data/Processed” folder under the file name “analysis_data.csv”
# Save analysis data
write.csv(analysis_data,"../Data/Processed/analysis_data.csv", row.names = F)- 1
- Save the analysis data to your desired location.
Exploratory Analysis
In data exploration, we use several methods to help us better understand the underlying patterns in the data without using a formal hypothesis test. In this pipeline, we are going to focus on two methods for data exploration:
A.) Principal component analysis
B.) Heatmaps
Before diving into the exploratory analysis, you will need to provide a path to where the analysis data is stored.
# 1. Read analysis dataset
analysis_data_path <- "../Data/Processed/analysis_data.csv"- 1
- Provide a path to where your analysis data is saved.
Additionally, please provide a path to where you would like to save the plots generated from this section.
out <- "../Outputs/Plots"- 1
- Provide a path to where you would like to store the outputed plots.
Principal Component Analysis
In general, Principal component analysis (PCA) reduces the number of variables in a dataset while preserving as much information from the data as possible. At a high level, PCA is constructed such that the first principal component accounts for the largest possible amount of variance within the data. The second principal component accounts for the largest amount of remaining variance, and so on. Additionally, each of the principal components produced by PCA is uncorrelated with each of the other principal components. PCA can allow us to visualize sources of variation in the data.
Here, we will graph the first two principal components of our dataset. In the PCA plot, we can label each point based on variables from the metadata.
Before running PCA you need to identify which variables in the data are NOT metabolites. Additionally, you need to include which variable you would like to label the PCA plot with.
# 1. Create the non-metabolite vector.
non_metabolite <- c("PARENT_SAMPLE_NAME",
"GROUP_NAME",
"TIME1",
"Gender")
# 2. Define labels
meta_var = "Gender"- 1
- Provide a list of the non-metabolite variables in the analysis dataset.
- 2
- Identify the variable you would like to use to label the PCA plot with.
Using the input you have given us, we will now construct the PCA plot.
# 3. Read in analysis data
analysis_data <- read.csv(analysis_data_path, check.names = F)
# 4. Create PCA data containing only metabolite data
pca_dat <- analysis_data[,!names(analysis_data) %in% non_metabolite]
# 5. Run PCA of the pca_dat matrix containing only the metabolites.
res.pca <- PCA(pca_dat,
graph = F)
# 6. Create figure
fviz_pca_ind(res.pca,
label = "none",
habillage = as.factor(analysis_data[,meta_var]))- 7
- Save figure
# 7. Save figure
ggsave(paste0(out,"/PCA.pdf"), width = 10, height = 10)Suppose you notice a variable with clearly separated groups, and is not a variable of interest. In that case, you may want to consider stratifying your analysis downstream by the values of that variable. For example, if we are looking at the effects of a specific drug, and we notice in the above plot that the samples are grouped by sex, then we may want to consider stratifying the analysis by male and female.
Heatmaps
Another exploratory analysis tool we can use is heatmaps, which is a gridded plot based on the x-axis- and y-axis labels. The color of the spot on the grid is based on the value’s intensity. For our heatmap, the x-axis will be the samples, and the y-axis will be the metabolites. The values determining the colors will be the log peak value for each chemical in each observation. We can order our observations to see intensity patterns based on our experimental conditions. This is another way of visualizing sources of variation within our data. To do this, we need to merge the chemical annotation, meta, and scaled peak data. Then we need to arrange the data based on our experimental conditions. The create_heatmap_Data function in the R folder will help us prepare the data for the heatmap. To do this the create_heatmap_Data function takes three arguments.
Analysis data (analysis_data)
Heatmap Variables (heatmap_variables) - defines the variables used to order the samples.
… -
The main utility of create_heatmap data is in (…), which allows you to arrange the columns of the heatmap data. If you are familiar with dplyr, the arrange function orders samples based on the arguments passed into (…). If you are unfamiliar with dplyr, an overview of the arrange function is here.
We will be building three heatmaps, each heatmap will be increasing in complexity. For your experiment you may not need to use all three of these heatmaps. These three heatmaps will be demonstrated with the mouse ALS data. For each heatmap we will look at the top 50 metabolites, which is determined by the highest mean value of the metabolites. By filtering to the top 50 metabolites we will improve our ability to see sources of variation.
Heatmap with top 50 metabolites: Treatment Groups
The first heatmap we are creating is going to be the most simple by arranging the sample by only the treatment group of interest. First you will need to provide the variable name that you would like to arrange the heatmap by.
# 1. Define grouping variables
heatmap_variables <- c("GROUP_NAME")- 1
- Define grouping variables
In the heatmap below, we select the top 50 metabolites. Once we know the top 50 metabolites with the highest mean value, we can filter the peak_data_log data only to include those metabolites. Then we can run the create_heatmap function to create the heatmap data. In the create_heatmap_data function, we arrange the data by treatment group.
# 2. Find the top 50 metabolites
select_variables <- analysis_data %>%
select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# 3. Filter to the top 50 metabolites
analysis_data_top50 <- analysis_data %>%
select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name)- 2
- Find the top 50 metabolites
- 3
- Filter to the top 50 metabolites
Now we will need your input to create the heatmap data. The last option allow you to arrange the variable have you like using dplyr’s arrange function.
# Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50,
heatmap_variables = heatmap_variables ,
GROUP_NAME)- 1
- This function creates the heatmap data based on the input you have given.
- 2
- Edit this line to change how the samples are arranged.
Once we have the heatmap data we can now generate and save the first heatmap.
# Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_variables
# Create heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA, show_colnames = F)- 4
- Save heatmap
# Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, show_colnames = F,border_color = NA,filename = paste0(out,"/HeatmapTop50Metabolites.pdf"))Heatmap with top 50 metabolites: Treatment and Block Groups
We can add more complexity to the heatmap by including the blocking variable in the annotation. To do this you will need to add the blocking variable to the “heatmap_variables” list.
# 1. Define meta analysis variables
heatmap_variables <- c("GROUP_NAME",
"TIME1")- 1
- Define meta analysis variables
Now we have the save steps from above to find the top 50 metabolites.
# 1. Find the top 50 metabolites
select_variables <- analysis_data %>%
select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# 2. Filter to the top 50 metabolites
analysis_data_top50 <- analysis_data %>%
select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name) - 2
- Filter to the top 50 metabolites
Now we need to include the blocking variable in the arrange options to create the heatmap data.
# 4. Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50,
heatmap_variables = heatmap_variables ,
GROUP_NAME, desc(TIME1)) #- 1
- This function creates the heatmap data based on the input you have given.
- 2
- Edit this line to change how the samples are arranged.
# Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_variables
# Create heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA, show_colnames = F)- 4
- Save heatmap
# Save heatmap
pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, show_colnames = F,border_color = NA,filename = paste0(out,"/HeatmapTop50MetabolitesTreatAndBlock.pdf"))Heatmap with top 50 metabolites: Treatment and Block Groups Stratified
We may want to stratify this heatmap even farther by adding an additional variable. When we add a third variable, the heatmap may become more challenging to interpret. To overcome this, we recommend stratifying the heatmap by the levels of the third variable. For example, if we want to stratify by gender, we will have two separate heatmaps, one for males and one for females.
Similarly to the above heatmap, we will need to include “heatmap_variables”. However, we also need to provide the variable we want to stratify by. The heatmaps will be created by stratifying by each level of the “strat_var” variable and then creating the heatmap for that variable.
# 1. Get stratified variable and stratified values
strat_var <- "Gender"
values <- unique(analysis_data[,strat_var])
# 2 Get heatmap Variables
heatmap_variables <- c("GROUP_NAME",
"TIME1")- 1
- Provide the variable name you would like to stratify by.
- 2
- This code finds all of the unique values of the stratified variables.
- 3
- Provide the variable names you would like to include in the heatmap.
Now you may need to edit the create_heatmap function within the for loop because this will need to be ran for each level of the stratified variable.
# Find the top 50 metabolites
select_variables <- analysis_data %>%
select(-all_of(c("PARENT_SAMPLE_NAME",heatmap_variables))) %>%
summarise(across(everything(),\(x) mean(x,na.rm = T))) %>%
pivot_longer(cols = everything()) %>%
arrange(desc(value)) %>%
slice(c(1:50))
# Create for loop
maps <- list()
for (i in 1:length(values)) {
# Filter to the top 50 metabolites
analysis_data_top50_i <- analysis_data[analysis_data[,strat_var]==values[i],] %>%
select(all_of(c("PARENT_SAMPLE_NAME", heatmap_variables)), select_variables$name)
# Create heatmap data
heatmap_data <- create_heatmap_Data(analysis_data_top50_i,
heatmap_variables = heatmap_variables ,
GROUP_NAME, desc(TIME1))
# Heat map colors
palette <- colorRampPalette(rev(brewer.pal(10, "RdBu")))(256)
# Values for heatmap
vals = heatmap_data$heatmap_data_vals
meta = heatmap_data$heatmap_variables
# Create heatmap
maps[[values[i]]] = pheatmap(vals, cluster_cols = F, cluster_rows = F, color = palette,
annotation_col = meta, show_rownames = F, border_color = NA, show_colnames = F,
main = values[i], silent = T)
}
# 8. Save heatmaps for all levels
grids <- grid.arrange(grobs= list(maps$Female[[4]],
maps$Male[[4]]))- 8
- Save heatmaps for all levels
ggsave(filename = paste0(out,"/HeatmapTop50MetabolitesTreamentandBlockStratified.pdf"), grids)Saving 12 x 12 in image
Sub-pathway Analysis
Before diving in, lets load the matrices that we will need for the Sub-pathway Analysis section. The two matrices we will need are:
Analysis data
Chemical annotation data (chem_data): Contains all of the information about the metabolites, including which sub-pathway and super-pathway they belong to.
Please provide the paths to the analysis data and the .xlsx file from Metabolon which contains the chemical annotation file.
# 1. Path to Metabolon .xlsx file.
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"
analysis_dat_path <- "../Data/Processed/analysis_data.csv"- 1
- Path to Metabolon .xlsx file
- 2
- Path to the analysis data file.
Now we will read in the data for these two files.
# 1. Analysis Data
analysis_data <- read.csv(analysis_dat_path, check.names = F)
# 2. Read chemical annotations
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")- 1
- Analysis Data
- 2
- Read chemical annotations
In the chemical annotation file, we will see that each metabolite is within a sub-pathway, and each sub-pathway is within a super-pathway. There are several metabolites within each sub-pathways and several sub-pathways within each Super-pathway. We can utilize an Analysis of variance (ANOVA) model to test for a difference in peak intensities between the treatment groups at the metabolite level, which is already part of the Metabolon analysis. However, since multiple metabolites are within a sub-pathway, it is challenging to test if the treatment affected the peak data at the sub-pathway level. For this, we will be utilizing a combined fisher probability test. The combined Fisher test combines the p-values from independent tests to test the hypothesis for an overall effect. For our use the Combined Fisher Probability is useful to test a model at the subpathway level based on the pvalues from the the model at the metabolite level.
Combined Fished Analysis
we will test at the sub-pathway level by combining the p-values for each metabolite within the sub-pathway for each model. We use a combination function given by \(\tilde{X}\) which combines the pvalues resulting in a chi-squared test statistic.
\[ \tilde{X} = -2\sum_{i=1}^k ln(p_i) \] where \(k\) is the number of metabablites in the sub-pathway. We can get a p-value from \(P(X \geq\tilde{X})\), knowing that \(\tilde{X}\sim \chi^2_{2k}\). You will notice that smaller pvalues will lead to a larger \(\tilde{X}\).
Assumptions
Since we are first testing each metabolite utilizing ANOVA, we make the following assumptions for each metabolite,
Independence: Each observation is independent of all other observations. Therefore, if you have collected multiple samples from the same subject then this assumption may be violated.
Normality: The metabolite log-scaled intensities follow a normal distribution within each of the treatment groups.
Homoscedasticity: Equal variance between treatment groups.
In addition to the assumptions in the ANOVA models at the metabolite level, the Fisher’s Combined probability places an independence assumptions between the metabolites being tested within the sub-pathway.
For more about the Combined Fisher Probability and other methods that can address this problem see:
Loughin, Thomas M. “A systematic comparison of methods for combining p-values from independent tests.” Computational statistics & data analysis 47.3 (2004): 467-485.
Models
To test our hypothesis at the sub-pathway level, we first have to form our hypothesis at the metabolite level. For each metabolite, we test three models.
1.) Interaction: \(log Peak = Treatment + block + Treatment*block\)
2.) Parallel: \(log Peak = Treatment + block\)
3.) Single: \(log Peak = Treatment\)
For the interaction model, we are focusing only on the interaction term “Treatment*block” for this to test if there is a significant interaction between our treatment and the block variable. The parallel model is testing if the block variable is explaining a significant amount of the metabolite variance, and the treatment model is testing if the treatment explains a significant proportion of the variance for each metabolite.
Each model will be tested at the subpathway level with the Combined Fisher Probability.
Combined Fisher Analysis
Now that we have our analysis data, we test the metabolomic data at the sub-pathway level utilizing the pathway_analysis function defined in the R script. This function takes the following arguments:
Ananlysis data (analysis_data): This is a data frame created in the “Analysis data creation” module which combines the metadata and the peak data into one analysis data set.
Chemical Annotation data (chem_data) provided by Metabolon.
Treatment variable (treat_var): This is the name of the variable in the analysis data that is the main variable of interest.
Block Variable (block_var): Is the name of the blocking variable in the dataset. If the the experimental design does not include a blocking variable, then the value of block_var=NULL.
Note: Metabolites with a large number of missing values may have unreliable results since the missing values were imputed with the minimum of the non-missing values of the metabolite. The warning message from the subpathway_analysis function will display the metabolite that produced the warning message.
# 1. Run the subpathway function
path_dat <- subpathway_analysis(analysis_data,
chem_data = chem_data,
block_var = "TIME1",
treat_var = "GROUP_NAME")- 1
- Run Subpathway analysis
# 1. Save results
write.csv(path_dat,file = "../Data/Results/NonStratified_Full_subpathway_results.csv", row.names = F)- 1
- Save results
Stratified analysis
We may notice that we need to stratify our analysis if we believe the effects of the model are different between the levels of the stratified variable. For example, we may notice that sex is a dominant source of variation in the metabolite data, and we may want to look at males and females separately. One way to do this is to subset the data prior to running the subpathway_analysis function.
The first step is to identify the treatment variable, block variable and the variable that you would like to stratify by.
# Enter block variable name
blockvar = "TIME1"
# Enter treatment variable
treatvar = "GROUP_NAME"
# 1. Name of the variable to stratify by.
stratified_var = "Gender"
# 3. Find unique values of this variable
uni_vals <- unique(analysis_data[,stratified_var])- 1
- Identify the treatment and block variable
- 2
- Identifiy the variable to stratify by
- 3
- Find the unique values of the stratified variable
Now you need to enter the path you would like to store the statified data and the stratified analysis results.
# Enter Statified Variable path (Do not include / at the end)
statified_data_path = "../Data/Processed"
# Enter Statified Results path
stratified_results_path = "../Data/Results"- 1
- Path for statified data
- 2
- Path for stratified analysis results.
Now the following chunk runs a for-loop and saves the results for each strata.
Note: Metabolites with a large number of missing values may have unreliable results since the missing values were imputed with the minimum of the non-missing values of the metabolite. The warning message from the subpathway_analysis function will display the metabolite that produced the warning message.
# For each value perform the four steps listed above.
for(i in uni_vals){
print(paste0("Running the ", i, " strata."))
# 1. Subset analysis data
strat_data = analysis_data[analysis_data[,stratified_var]==i,]
# 2. Run subpathway function on the subsetted data
strat_path_results <- subpathway_analysis(strat_data,
chem_data = chem_data,
block_var = blockvar,
treat_var = treatvar)
# 3 Save stratified data
write.csv(strat_data, paste0(statified_data_path,"/analysis_data_",i,".csv"), row.names = F)
# 4 Save results
write.csv(strat_path_results, paste0(stratified_results_path,"/Stratified_",i,"_subpathway_results.csv"), row.names = F)
}[1] "Running the Female strata."
[1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
[1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
[1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924625"
[1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
[1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
[1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999925396"
[1] "Running the Male strata."
[1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
[1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
[1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 100001394"
[1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
[1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
[1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924548"
[1] "simpleWarning in anova.lm(int_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
[1] "simpleWarning in anova.lm(par_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
[1] "simpleWarning in anova.lm(treat_mod): ANOVA F-tests on an essentially perfect fit are unreliable\n for CHEM ID 999924748"
Number of significant subpathways by model type
Once we have the Overall Analysis results, we will have tested all metabolites with an interaction, parallel and Single model. We then obtain a p-values through the combined fisher analysis for each sub-pathway. In the next few chunks, we will summarize the results with a few different tables.
Before we create those table, please provide a path to the results you would like summarize. If you stratified the analysis you can provide a list of .csv files (one .csv file for each strata.). Additionlly, you will need to provide a list of the strata names.
# 1. Read in Results from the analysis step.
results_path <- c("../Data/Results/Stratified_Male_subpathway_results.csv",
"../Data/Results/Stratified_Female_subpathway_results.csv")
# 2. Define the names of the strata
strata <- c("Male", "Female")- 1
- Path to results
- 2
- Strata name for each path.
Additionally, please add a path to where you would like the results summary tables stored.
# Enter path to save tables
tablePath = "../Outputs/Tables"- 1
- Path to where the results are stored.
Now we will create the table that show the number of significant subpathways by model type for each strata.
Note: If you only tested the sub-pathways for the single model then only the single model is displayed.
Note: The model type for each sub-pathway is determined hiearchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.
for (i in 1:length(strata)) {
# 1. Read in results data
path_data <- read.csv(results_path[i], check.names = F)
# 2. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
}
# 3. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 4. Create table
sig_subPaths <- count(table_data, model) %>%
arrange((model)) %>%
flextable() %>%
set_header_labels(model = "Model Type", n="Count")%>%
theme_vanilla() %>%
set_table_properties(layout = "autofit") %>%
set_caption(caption = paste0("Sigificant Pathways by Model (",
strata[i],")."))
# 5. Display table
cat(knitr::knit_print(sig_subPaths))
# 6. Save table
save_as_docx(sig_subPaths,path=paste0(
tablePath ,"/NumberOfSigPathwaysByModelType_",strata[i],".docx"))
} - 1
- Read in results data
- 2
- Structure levels
- 3
- Create table data
- 4
- Create table
- 5
- Display table
- 6
- Save table #### Proportion of significant subpathways within super pathways:
```{=html}
<div class="tabwid"><style>.cl-7e571160{table-layout:auto;}.cl-7e50465a{font-family:'Helvetica';font-size:11pt;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-7e50466e{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-7e53404e{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-7e534058{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-7e53519c{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351a6{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351a7{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351b0{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351b1{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351b2{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351b3{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e5351ba{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-7e571160'><thead><tr style="overflow-wrap:break-word;"><th class="cl-7e53519c"><p class="cl-7e53404e"><span class="cl-7e50465a">Model Type</span></p></th><th class="cl-7e5351a6"><p class="cl-7e534058"><span class="cl-7e50465a">Count</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-7e5351a7"><p class="cl-7e53404e"><span class="cl-7e50466e">Interaction</span></p></td><td class="cl-7e5351b0"><p class="cl-7e534058"><span class="cl-7e50466e">84</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-7e5351b1"><p class="cl-7e53404e"><span class="cl-7e50466e">Parallel</span></p></td><td class="cl-7e5351b2"><p class="cl-7e534058"><span class="cl-7e50466e">12</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-7e5351b1"><p class="cl-7e53404e"><span class="cl-7e50466e">Single</span></p></td><td class="cl-7e5351b2"><p class="cl-7e534058"><span class="cl-7e50466e">2</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-7e5351b3"><p class="cl-7e53404e"><span class="cl-7e50466e">None</span></p></td><td class="cl-7e5351ba"><p class="cl-7e534058"><span class="cl-7e50466e">12</span></p></td></tr></tbody></table></div>
```
```{=html}
<div class="tabwid"><style>.cl-7e7a5878{table-layout:auto;}.cl-7e73b342{font-family:'Helvetica';font-size:11pt;font-weight:bold;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-7e73b34c{font-family:'Helvetica';font-size:11pt;font-weight:normal;font-style:normal;text-decoration:none;color:rgba(0, 0, 0, 1.00);background-color:transparent;}.cl-7e76819e{margin:0;text-align:left;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-7e7681a8{margin:0;text-align:right;border-bottom: 0 solid rgba(0, 0, 0, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);padding-bottom:5pt;padding-top:5pt;padding-left:5pt;padding-right:5pt;line-height: 1;background-color:transparent;}.cl-7e7692f6{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e769300{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 1.5pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e76930a{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e76930b{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0 solid rgba(0, 0, 0, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e76930c{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e76930d{background-color:transparent;vertical-align: middle;border-bottom: 0.75pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e769314{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}.cl-7e769315{background-color:transparent;vertical-align: middle;border-bottom: 1.5pt solid rgba(102, 102, 102, 1.00);border-top: 0.75pt solid rgba(102, 102, 102, 1.00);border-left: 0 solid rgba(0, 0, 0, 1.00);border-right: 0 solid rgba(0, 0, 0, 1.00);margin-bottom:0;margin-top:0;margin-left:0;margin-right:0;}</style><table data-quarto-disable-processing='true' class='cl-7e7a5878'><thead><tr style="overflow-wrap:break-word;"><th class="cl-7e7692f6"><p class="cl-7e76819e"><span class="cl-7e73b342">Model Type</span></p></th><th class="cl-7e769300"><p class="cl-7e7681a8"><span class="cl-7e73b342">Count</span></p></th></tr></thead><tbody><tr style="overflow-wrap:break-word;"><td class="cl-7e76930a"><p class="cl-7e76819e"><span class="cl-7e73b34c">Interaction</span></p></td><td class="cl-7e76930b"><p class="cl-7e7681a8"><span class="cl-7e73b34c">6</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-7e76930c"><p class="cl-7e76819e"><span class="cl-7e73b34c">Parallel</span></p></td><td class="cl-7e76930d"><p class="cl-7e7681a8"><span class="cl-7e73b34c">20</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-7e76930c"><p class="cl-7e76819e"><span class="cl-7e73b34c">Single</span></p></td><td class="cl-7e76930d"><p class="cl-7e7681a8"><span class="cl-7e73b34c">19</span></p></td></tr><tr style="overflow-wrap:break-word;"><td class="cl-7e769314"><p class="cl-7e76819e"><span class="cl-7e73b34c">None</span></p></td><td class="cl-7e769315"><p class="cl-7e7681a8"><span class="cl-7e73b34c">65</span></p></td></tr></tbody></table></div>
```
Additionally, we can summarize the results by looking at the number of significant sub-pathways within a super-pathway for each model type.
Note: The model type for each sub-pathway is determined hierarchically. For instance, if a sub-pathway shows a significant p-value for the interaction model and the parallel model, then the sub-pathway model type is only “interaction”.
for(i in 1:length(strata)){
# 1. Read results for the strata
path_data <- read.csv(results_path[i], check.names = F)
# 2. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 4. Formulate the Super-pathway results table.
superPath <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, pval) %>%
right_join(chem_data %>% select(SUPER_PATHWAY, SUB_PATHWAY) %>% distinct(),
by = c("sub_pathway" = "SUB_PATHWAY")) %>%
filter(!is.na(sub_pathway)) %>%
mutate(sig = ifelse(is.na(pval), 0, 1)) %>%
group_by(SUPER_PATHWAY) %>%
summarise(percent_significant = round(mean(sig) * 100, 2),
number_significant = sum(sig),
pathway_count = n()) %>%
ungroup() %>% arrange(-percent_significant) %>%
transmute(SUPER_PATHWAY, percent_significant = paste0(number_significant, " / ", pathway_count,
" (", percent_significant, "%)")) %>%
knitr::kable(format = 'pipe',col.names = c("Super Pathway", "Percent Significant"),
caption = paste0("Proportion of significant <br> subpathways within super-pathways (",
strata[i],").") )
# 5. Display table
print(superPath)
cat("\n")
# 6. Save table
save_kable(superPath, file = paste0(tablePath,"/SigSuperPathwayPecentages_",strata[i],".png"))
} - 1
- Read results for the strata
- 2
- Structure levels
- 3
- Create table data
- 4
- Formulate the Super-pathway results table.
- 5
- Display table
- 6
- Save table
Table: Proportion of significant <br> subpathways within super-pathways (Male).
|Super Pathway |Percent Significant |
|:---------------------------------|:-------------------|
|Amino Acid |16 / 16 (100%) |
|Energy |2 / 2 (100%) |
|Nucleotide |8 / 8 (100%) |
|Partially Characterized Molecules |1 / 1 (100%) |
|Xenobiotics |5 / 5 (100%) |
|Carbohydrate |7 / 8 (87.5%) |
|Lipid |46 / 53 (86.79%) |
|Cofactors and Vitamins |9 / 11 (81.82%) |
|Peptide |3 / 5 (60%) |
Table: Proportion of significant <br> subpathways within super-pathways (Female).
|Super Pathway |Percent Significant |
|:---------------------------------|:-------------------|
|Xenobiotics |5 / 5 (100%) |
|Amino Acid |13 / 16 (81.25%) |
|Peptide |3 / 5 (60%) |
|Nucleotide |3 / 8 (37.5%) |
|Lipid |16 / 53 (30.19%) |
|Carbohydrate |2 / 8 (25%) |
|Cofactors and Vitamins |2 / 11 (18.18%) |
|Energy |0 / 2 (0%) |
|Partially Characterized Molecules |0 / 1 (0%) |
List of subpathways that were found to have significant Fisher method p-values.
The next table we create will show all of the significant sub-pathways and their model type.
# Create for loop
for (i in 1:length(strata)){
# 1. Read Path data
path_data <- read.csv(results_path[i])
# 2. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Create pathway table
pathway_table <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct() %>%
filter(model != "None") %>%
mutate(pval = round(eval(cases),3)) %>%
select(sub_pathway, model, pval) %>%
arrange(model, pval)
# 4. Format table
path_tab = pathway_table %>%
knitr::kable(format = 'pipe',col.names = c("Subpathway", "Model Type","P-value"),
caption = paste0("Significant Subpathways (",
strata[i],").") )
# 5. Display table
print(path_tab)
# 6. Save table
save_kable(path_tab ,file = paste0(tablePath,"/SigSubpathwayTable_",strata[i],".png"))
}- 1
- Read Path data
- 2
- Structure levels
- 3
- Create pathway table
- 4
- Format table
- 5
- Display table
- 6
- Save table
Table: Significant Subpathways (Male).
|Subpathway |Model Type | P-value|
|:------------------------------------------------------------|:-----------|-------:|
|Glutamate Metabolism |Interaction | 0.000|
|Polyamine Metabolism |Interaction | 0.000|
|Nicotinate and Nicotinamide Metabolism |Interaction | 0.000|
|Fatty Acid, Dihydroxy |Interaction | 0.000|
|Tryptophan Metabolism |Interaction | 0.000|
|Lysine Metabolism |Interaction | 0.000|
|TCA Cycle |Interaction | 0.000|
|Purine Metabolism, Guanine containing |Interaction | 0.000|
|Leucine, Isoleucine and Valine Metabolism |Interaction | 0.000|
|Glycolysis, Gluconeogenesis, and Pyruvate Metabolism |Interaction | 0.000|
|Primary Bile Acid Metabolism |Interaction | 0.000|
|Purine Metabolism, (Hypo)Xanthine/Inosine containing |Interaction | 0.000|
|Long Chain Polyunsaturated Fatty Acid (n3 and n6) |Interaction | 0.000|
|Medium Chain Fatty Acid |Interaction | 0.000|
|Methionine, Cysteine, SAM and Taurine Metabolism |Interaction | 0.000|
|Chemical |Interaction | 0.000|
|Purine Metabolism, Adenine containing |Interaction | 0.000|
|Pentose Phosphate Pathway |Interaction | 0.000|
|Urea cycle; Arginine and Proline Metabolism |Interaction | 0.000|
|Alanine and Aspartate Metabolism |Interaction | 0.000|
|Sterol |Interaction | 0.000|
|Phospholipid Metabolism |Interaction | 0.000|
|Creatine Metabolism |Interaction | 0.000|
|Secondary Bile Acid Metabolism |Interaction | 0.000|
|Sphingolipid Synthesis |Interaction | 0.000|
|Gamma-glutamyl Amino Acid |Interaction | 0.000|
|Food Component/Plant |Interaction | 0.000|
|Fatty Acid, Dicarboxylate |Interaction | 0.000|
|Long Chain Saturated Fatty Acid |Interaction | 0.000|
|Pyrimidine Metabolism, Orotate containing |Interaction | 0.000|
|Long Chain Monounsaturated Fatty Acid |Interaction | 0.000|
|Vitamin A Metabolism |Interaction | 0.000|
|Pyrimidine Metabolism, Uracil containing |Interaction | 0.000|
|Pentose Metabolism |Interaction | 0.000|
|Pyrimidine Metabolism, Cytidine containing |Interaction | 0.000|
|Tocopherol Metabolism |Interaction | 0.000|
|Endocannabinoid |Interaction | 0.000|
|Aminosugar Metabolism |Interaction | 0.000|
|Fatty Acid, Monohydroxy |Interaction | 0.000|
|Glycerolipid Metabolism |Interaction | 0.000|
|Ceramides |Interaction | 0.000|
|Phosphatidylethanolamine (PE) |Interaction | 0.000|
|Phosphatidylinositol (PI) |Interaction | 0.000|
|Phosphatidylcholine (PC) |Interaction | 0.000|
|Sphingomyelins |Interaction | 0.000|
|Dipeptide |Interaction | 0.000|
|Monoacylglycerol |Interaction | 0.000|
|Lysoplasmalogen |Interaction | 0.000|
|Lysophospholipid |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Carnitine, Long Chain Saturated) |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Carnitine, Medium Chain) |Interaction | 0.000|
|Ascorbate and Aldarate Metabolism |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Glycine) |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Carnitine, Monounsaturated) |Interaction | 0.000|
|Partially Characterized Molecules |Interaction | 0.000|
|Hexosylceramides (HCER) |Interaction | 0.000|
|Fatty Acid, Branched |Interaction | 0.000|
|Diacylglycerol |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Carnitine, Polyunsaturated) |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Carnitine, Hydroxy) |Interaction | 0.000|
|Fatty Acid, Amino |Interaction | 0.000|
|Fatty Acid Metabolism (Acyl Carnitine, Dicarboxylate) |Interaction | 0.000|
|Plasmalogen |Interaction | 0.000|
|Dihydrosphingomyelins |Interaction | 0.000|
|Lactoyl Amino Acid |Interaction | 0.000|
|Unknown Metabolite |Interaction | 0.000|
|Glycosyl PE |Interaction | 0.002|
|Tyrosine Metabolism |Interaction | 0.010|
|Histidine Metabolism |Interaction | 0.010|
|Hemoglobin and Porphyrin Metabolism |Interaction | 0.010|
|Sphingosines |Interaction | 0.010|
|Glycine, Serine and Threonine Metabolism |Interaction | 0.010|
|Glutathione Metabolism |Interaction | 0.010|
|Vitamin B6 Metabolism |Interaction | 0.010|
|Thiamine Metabolism |Interaction | 0.010|
|Fructose, Mannose and Galactose Metabolism |Interaction | 0.010|
|Phenylalanine Metabolism |Interaction | 0.020|
|Fatty Acid Metabolism (also BCAA Metabolism) |Interaction | 0.020|
|Disaccharides and Oligosaccharides |Interaction | 0.033|
|Ketone Bodies |Interaction | 0.034|
|Mevalonate Metabolism |Interaction | 0.040|
|Riboflavin Metabolism |Interaction | 0.040|
|Tetrahydrobiopterin Metabolism |Interaction | 0.040|
|Guanidino and Acetamido Metabolism |Interaction | 0.040|
|Drug - Topical Agents |Parallel | 0.000|
|Pyrimidine Metabolism, Thymine containing |Parallel | 0.000|
|Dihydroceramides |Parallel | 0.000|
|Eicosanoid |Parallel | 0.000|
|Benzoate Metabolism |Parallel | 0.000|
|Phosphatidylserine (PS) |Parallel | 0.000|
|Dipeptide Derivative |Parallel | 0.000|
|Docosanoid |Parallel | 0.001|
|Glycogen Metabolism |Parallel | 0.002|
|Purine and Pyrimidine Metabolism |Parallel | 0.008|
|Inositol Metabolism |Parallel | 0.010|
|Drug - Other |Parallel | 0.040|
|Phosphatidylglycerol (PG) |Single | 0.001|
|Oxidative Phosphorylation |Single | 0.026|
Table: Significant Subpathways (Female).
|Subpathway |Model Type | P-value|
|:----------------------------------------------------|:-----------|-------:|
|Purine Metabolism, (Hypo)Xanthine/Inosine containing |Interaction | 0.000|
|Pentose Phosphate Pathway |Interaction | 0.004|
|Glycogen Metabolism |Interaction | 0.011|
|Food Component/Plant |Interaction | 0.020|
|Drug - Other |Interaction | 0.040|
|Lactoyl Amino Acid |Interaction | 0.040|
|Nicotinate and Nicotinamide Metabolism |Parallel | 0.000|
|Primary Bile Acid Metabolism |Parallel | 0.000|
|Chemical |Parallel | 0.000|
|Histidine Metabolism |Parallel | 0.000|
|Hemoglobin and Porphyrin Metabolism |Parallel | 0.000|
|Secondary Bile Acid Metabolism |Parallel | 0.000|
|Sphingolipid Synthesis |Parallel | 0.000|
|Eicosanoid |Parallel | 0.000|
|Benzoate Metabolism |Parallel | 0.000|
|Acetylated Peptides |Parallel | 0.000|
|Phosphatidylserine (PS) |Parallel | 0.000|
|Hexosylceramides (HCER) |Parallel | 0.000|
|Diacylglycerol |Parallel | 0.000|
|Unknown Metabolite |Parallel | 0.000|
|Drug - Topical Agents |Parallel | 0.010|
|Polyamine Metabolism |Parallel | 0.020|
|Phospholipid Metabolism |Parallel | 0.030|
|Glutamate Metabolism |Parallel | 0.040|
|Urea cycle; Arginine and Proline Metabolism |Parallel | 0.040|
|Glycine, Serine and Threonine Metabolism |Parallel | 0.040|
|Tryptophan Metabolism |Single | 0.000|
|Lysine Metabolism |Single | 0.000|
|Leucine, Isoleucine and Valine Metabolism |Single | 0.000|
|Phenylalanine Metabolism |Single | 0.000|
|Methionine, Cysteine, SAM and Taurine Metabolism |Single | 0.000|
|Purine Metabolism, Adenine containing |Single | 0.000|
|Tyrosine Metabolism |Single | 0.000|
|Fatty Acid, Dicarboxylate |Single | 0.000|
|Fatty Acid Metabolism (also BCAA Metabolism) |Single | 0.000|
|Dipeptide |Single | 0.000|
|Lysoplasmalogen |Single | 0.000|
|Plasmalogen |Single | 0.000|
|Inositol Metabolism |Single | 0.010|
|Pyrimidine Metabolism, Uracil containing |Single | 0.010|
|Docosanoid |Single | 0.015|
|Guanidino and Acetamido Metabolism |Single | 0.020|
|Lysophospholipid |Single | 0.020|
|Dipeptide Derivative |Single | 0.030|
|Fatty Acid Metabolism (Acyl Glycine) |Single | 0.040|
Metabolites within significant subpathways
Now we will looks at the metabolites within significant sub-pathways, and the p-values for the significant model. In this section we will look at all of the metabolites within the top two sub-pathways based on p-values. To do this we will first need to identify which model we would like to look at.
Please identify which model type you would like to focus on.
# Enter Model type (Interaction, Parallel, Single)
mod = "Interaction"- 1
- Enter Model type (Interaction, Parallel, Single)
# Create for loop
for(i in 1:length(strata)){
# 1. Read in table data and results from overall analysis.
path_data <- read.csv(results_path[i], check.names = F)
# 2. Structure levels
if(sum(grepl("interaction",names(path_data)))==0){
levs <- c("Single","None")
cases <- expression( case_when(model == "Single" ~ single_fisher))
}
if(!sum(grepl("interaction",names(path_data)))==0){
levs = c("Interaction", "Parallel", "Single", "None")
cases <- expression( case_when(model == "Interaction" ~ interaction_fisher,
model == "Parallel" ~ parallel_fisher,
model == "Single" ~ single_fisher))
}
# 3. Create table data
table_data <- path_data %>%
mutate(model = factor(model, levels = levs)) %>%
select(sub_pathway, ends_with("_fisher"), model) %>%
distinct()
# 4. Create top pathways table
top_pathway_table <- table_data %>%
filter(model != "None") %>%
mutate(pval = eval(cases)) %>%
select(sub_pathway, model, pval) %>%
arrange(model, pval) %>%
filter(model == mod) %>%
group_by(model) %>% slice(1:2) %>% ungroup() %>%
left_join(select(path_data, sub_pathway, chem_name, ends_with("_pval")),
by = "sub_pathway") %>%
distinct()
# 5. list the top two subpathways.
pathways <- unique(top_pathway_table$sub_pathway)
# Create pathways for look
for(j in 1:length(pathways)){
# 6. Create table for specific model type
top_paths_table <- top_pathway_table %>%
filter(sub_pathway == pathways[j]) %>%
select(chem_name, all_of(paste0(tolower(mod),"_pval"))) %>%
arrange(all_of(paste0(tolower(mod),"p_val"))) %>%
knitr::kable(format = "pipe",col.names = c("Chemical Name","P-value"), caption = paste0("Metabolites Within Pathways of Significant ",mod, " Model(",
strata[i],")."))
# 7. Display table
print(top_paths_table)
# 8. Save table
save_kable(top_pathway_table,
file = paste0(tablePath,"/Metabolite_model_pvalues_",gsub("[^A-Za-z0-9 ]","",pathways[j]),"_",strata[i],".png"))
}
} - 1
- Read in table data and results from overall analysis.
- 2
- Structure levels
- 3
- Create table data
- 4
- Create top pathways table
- 5
- list the top two subpathways.
- 6
- Create table for specific model type
- 7
- Display table
- 8
- Save table
Table: Metabolites Within Pathways of Significant Interaction Model(Male).
|Chemical Name | P-value|
|:----------------------------------|---------:|
|S-1-pyrroline-5-carboxylate | 0.3561361|
|glutamate | 0.2329614|
|glutamine | 0.0226899|
|N-acetylglutamate | 0.0910388|
|N-acetylglutamine | 0.0907479|
|N-acetyl-aspartyl-glutamate (NAAG) | 0.1034220|
|alpha-ketoglutaramate* | 0.0747458|
|4-hydroxyglutamate | 0.2484560|
|N-methyl-GABA | 0.0762810|
|carboxyethyl-GABA | 0.0606772|
Table: Metabolites Within Pathways of Significant Interaction Model(Male).
|Chemical Name | P-value|
|:------------------------------|---------:|
|putrescine | 0.1814704|
|spermidine | 0.0605244|
|N-acetylputrescine | 0.0417295|
|5-methylthioadenosine (MTA) | 0.1311389|
|spermine | 0.0912480|
|4-acetamidobutanoate | 0.2824201|
|(N(1) + N(8))-acetylspermidine | 0.0344458|
Table: Metabolites Within Pathways of Significant Interaction Model(Female).
|Chemical Name | P-value|
|:------------------------------|---------:|
|hypoxanthine | 0.0063033|
|allantoic acid | 0.2504492|
|inosine | 0.0013076|
|inosine 5'-monophosphate (IMP) | 0.8593660|
|allantoin | 0.3160026|
|xanthine | 0.0207493|
|urate | 0.6665537|
|xanthosine | 0.9614498|
|N1-methylinosine | 0.9407514|
Table: Metabolites Within Pathways of Significant Interaction Model(Female).
|Chemical Name | P-value|
|:------------------|---------:|
|ribose 1-phosphate | 0.0036726|
Pairwise Compairisons
At the metabolite level, we can look at the pairwise comparisons for all experimental groups. In the following section we will look at the pairwise comparisons using the metabolite_pairwise function defined in the R scripts. We will first read in the analysis data needed for this analysis. Since the column names of the analysis data are numeric, we should convert all of the column names to to characters.
path3 = "../Data/Processed/analysis_data.csv"- 1
- Please provide a path to the analysis data
Additionally, please provide a path to the folder where you would like to store the pairwise results.
pair_strat_results_path = "../Data/Results/Pairwise_Comparisons"- 1
- Please provide a path to where you would like to store the results.
Once we have the analysis data loaded in, we can now look at the pairwise comparisons for each of our experimental groups. To do this we will be using the metabolite_pairwise function defined in the R scripts. This function will analyze each metabolite individually. For each metabolite, the the metabolite_pairwise function will first test if any of the experimental groups explained a significant proportion of the variance in the metabolite using an F-test. Since we will be looking at multiple comparisons for the metabolite it is good practice to first look at the over-all p-value from the F-test before looking at the pairwise comparisons.
The metabolite_pairwise function then looks at all pairwise comparisons utilizing the emmeans package. The metabolite_pairwise function returns a data frame with the metabolite, overall p-value, estimated difference in means for all experimental group combinations, and the p-value which test if the difference the groups is significantly different.
The metabolite_pairwise function has three arguments:
form: This is a character string the resembles the right hand side of a simple linear regression model in R. For example form = “Group1 + Group2”.
data: The analysis data we will use for the pairwise comparisons.
Metabolites: A list of metabolites that we want to be analyzed.
In this chunk please ender the variable you would like to stratify by, the form of the model, and the non-metabolite variables in the analysis dataset.
# 1. Define the variable in the analysis data that we want to stratify
strata_var <- "Gender"
# 2. Define the form variable
form <- "GROUP_NAME*TIME1"
# Enter Non-metabolite variables
non_metabolite <- c("PARENT_SAMPLE_NAME", "GROUP_NAME",
"TIME1", "Gender")
metabolites <- names(analysis_data)[!names(analysis_data) %in% non_metabolite]- 1
- Define variable to stratify by
- 2
- Define the functional form of the model
- 3
- enter the non-metabolite variables in the analysis dataset
- 4
- This code automatically gets the metabolite names.
Once we have the desired information, we can run the stratified pairwise analysis.
# 1. Read in data from the analysis above
analysis_data <- read.csv(path3, check.names = F)
# 2. Make the variable names charactors
names(analysis_data) <- as.character(names(analysis_data))
stratas <- unique(analysis_data[,strata_var])
for (i in 1:length(stratas)) {
# 3. Filter analysis data
dat <- analysis_data[analysis_data[,strata_var]==stratas[i],]
# 4. Run the metabolite_pairwise function for the strata
strat_results <- metabolite_pairwise(form = form,
data=dat,
metabolites = metabolites)
# 5. Save results
write.csv(strat_results, file = paste0(pair_strat_results_path,"/Pairwise_",
stratas[i],".csv"), row.names = F)
}- 1
- Read in data from the analysis above
- 2
- Make the variable names characters
- 3
- Filter analysis data
- 4
- Run the metabolite_pairwise function for the strata
- 5
- Save results
================================================================================
================================================================================
Log Fold-Change Heatmap
For the metabolites which had a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis), we will produce a heatmap of the log fold changes. This heatmap colors will only show if the log fold-change is greater than log(2) or less than log(.5). Therefore, this heatmap will only focus on comparisons with a fold change of two or greater.
Before creating the heatmaps, please provide a path to the stratified result file you would like to use. Additionally, please provide a path to the Metabolone .xlsx file so we can utilize the chemical annotation file.
# Enter pairwise results path
pair_results_path = "../Data/Results/Pairwise_Comparisons/Pairwise_Female.csv"
# Enter path to metabolome .xlsx file
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"- 1
- Path to stratified result
- 2
- Path to Metabolome .xlsx file
Now we will generate an interactive estimate heatmap.
# 1. Read in pairwise results
results_pair <-read.csv(pair_results_path,
check.names = F) %>%
filter(Overall_pval < 0.05)
# Read chemical annotation file
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 2. Merge the chemical annotation fill with the results from the pairwise comparisons.
data <- chem_data %>%
select(SUB_PATHWAY,CHEMICAL_NAME,CHEM_ID) %>%
merge(results_pair, by.x = "CHEM_ID",by.y = "Metabolite") %>%
arrange(SUB_PATHWAY)
# 3. Produce Heatmap
p = data %>%
select(CHEM_ID,SUB_PATHWAY,CHEMICAL_NAME, all_of(names(data)[grepl("EST",names(data))])) %>%
melt(id.vars = c("CHEM_ID","SUB_PATHWAY","CHEMICAL_NAME"), variable.name = "Contrast",
value.name = "logFoldChange") %>%
mutate(Contrast = gsub("_ESTS","",Contrast),
logFoldChange = ifelse(logFoldChange < log(0.5) | logFoldChange > log(2),
round(logFoldChange,3), NA)) %>%
plot_ly(
type = "heatmap",
x= ~Contrast,
y = ~CHEMICAL_NAME,
z = ~logFoldChange,
text = ~SUB_PATHWAY,
hovertemplate = paste("<b>Metabolite: %{y}</b><br><br>",
"Sub-pathway: %{text}<br>",
"Contrast: %{x}<br>",
"Log Fold-Change: %{z}<br>",
"<extra></extra>"),
colorbar = list(title ="<b>Log Fold-Change</b>")) %>%
layout(title = "<b>Log Fold-Change Heatmap</b>",
xaxis = list(title="<b>Contrasts</b>"),
yaxis = list(title = ""))
# 4. Display heatmap
p- 1
- Read in pairwise results
- 2
- Merge the chemical annotation fill with the results from the pairwise comparisons.
- 3
- Produce Heatmap
- 4
- Display heatmap
P-Value Heatmap
For the metabolites which had a significant overall p-value (which tested if the treatment group means were equal under the null hypothesis), we will now produce a heatmap based on the pairwise comparison p-values.
# 1. Read in pairwise results
results_pair <-read.csv(pair_results_path,
check.names = F) %>%
filter(Overall_pval < 0.05)
# Read chemical annotation file
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 2. Merge the chemical annotation fill with the results from the pairwise comparisons.
data <- chem_data %>%
select(SUB_PATHWAY,CHEMICAL_NAME,CHEM_ID) %>%
merge(results_pair, by.x = "CHEM_ID",by.y = "Metabolite") %>%
arrange(SUB_PATHWAY)
# 3. Produce Heatmap
p = data %>%
select(CHEM_ID,SUB_PATHWAY,CHEMICAL_NAME, all_of(names(data)[grepl("PVALS",names(data))])) %>%
melt(id.vars = c("CHEM_ID","SUB_PATHWAY","CHEMICAL_NAME"), variable.name = "Contrast",
value.name = "P_value") %>%
mutate(Contrast = gsub("_PVALS","",Contrast),
P_value = round(P_value,3)) %>%
arrange(SUB_PATHWAY) %>%
plot_ly(
type = "heatmap",
x= ~Contrast,
y = ~CHEMICAL_NAME,
z = ~P_value,
text = ~SUB_PATHWAY,
hovertemplate = paste("<b>Metabolite: %{y}</b><br><br>",
"Sub-pathway: %{text}<br>",
"Contrast: %{x}<br>",
"P-Value: %{z}<br>",
"<extra></extra>"),
colorbar = list(title ="<b>P-value</b>")) %>%
layout(title = "<b>P-Value Heatmap</b>",
xaxis = list(title="<b>Contrasts</b>"),
yaxis = list(title = ""))
# 4. Display heatmap
p- 1
- Read in the results from the pairwise comparisons. If you stratified the analysis, you may want to copy this chunk and rerun for each strata. Additionally, we will need to read in the chemical annotation file. When we read in the results from the pairwise comparisons, we will filter the results to only look at the metabolites with a significant overall p-value.
- 2
- Merge the chemical annotation fill with the results from the pairwise comparisons.
- 3
- Create the heatmap. In this step we are doing some pre-precessing which removes the log fold change columns and then transforms the remaining p-value results from wide to long format. Then we use plotly to create an interactive heatmap.
Box Plots and Line Plots
Visualizations of the data can help us see the underlying trends. Two useful visualization is boxplots and line graphs. Metabolon provides customers with similar box plots, but the following will allow to look at the boxplots specifically with a sub-pathway.
Boxplots
Suppose there is a specific subpathway that we would like to visualize. In that case, we can compare the treatment groups for each metabolite within that sub-pathway using the subpathway_box plots function defined in the R script. This function takes the following arguments.
Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the sub pathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Pathway (pathway): This is the name of the sub-pathway of interest.
X: This the the name of the variable in the meta data that is used for the X axis of the box plots.
Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.
… : At the end of this function you can provide additional arguments for filtering the analysis data. In the examples below we are filtering so the box plots only focus on males and only two treatment groups.
The output of the subpathway_boxplot function is box plots for each of the metabolites within the specified subpathway. The labels for the box plots are default based on the data created inside the subpathway_boxplots function. Since these box plots utilize ggplot, we can customize all labels using the. labs function from ggplot.
Box plots example
Here we look at the boxplots for each metabolite within a specified subpathway utilizing the subpathway_boxplots function.
First prove a path to the analysis data and the Metabolon .xlsx file so we can utilize this chemical annotation file.
# Enter pairwise results path
analysis = "../Data/Processed/analysis_data.csv"
# Enter chemical annotation path
met_excel <- "../Data/Metabolon/UNAZ-0501-22VW_ DATA TABLES.xlsx"- 1
- Provide path to the analysis data file
- 2
- Provide path to the Metabolon .xlsx file
Now you need to specify a subpathway of interest.
# 1. Specify subpathway of interest.
subpath = "Gamma-glutamyl Amino Acid"- 1
- Specify subpathway of interest.
Finally, specifiy where you would like to save the graph.
plot_save = "../Outputs"- 1
- Provide path to outputs.
The following chunk will generate box plots accross the expiremental conditions. You will need to edit the arguments in the subpathway_boxplots function to change the ouputed boxplots
# 1. Read in analysis data
analysis_data <- read.csv(analysis, check.names = F)
# 2. Reach in chemical annotation data
# Read chemical annotation file
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 3. Run the subpathway_boxplots function
box_1 <- subpathway_boxplots(analysis_data = analysis_data,
chem_data = chem_data,
subpathway = subpath,
X = TIME1,
groupBy = GROUP_NAME,
# Additional analysis data filtering options.
Gender == "Male", GROUP_NAME != "Control")
# 4. Customize labels for the boxplots
box_1 +
labs(x = "Time", y="Scaled Intensity", color="Treatment Group",
title = "Metabolite boxplots for the Gamma-glutamyl Amino Acid Subpathway")- 5
- Save boxplots. You can change the file type for example change pdf to png.
# 5. Save boxplots. You can change the file type for example change pdf to png.
ggsave(filename = paste0(plot_save,"/Plots/Metabolite_boxplots_",subpath,".pdf"))Saving 10 x 10 in image
Line plots
Another useful visualization is line plots. These plots are useful for seeing trends between an ordered variable and the scaled peak intensity. Line plots can be useful for analyzing trends across the different treatment groups. Like the box plots, we are interested in seeing trends for all metabolites within a sub-pathway. To do this, we can utilize the “subpathway_lineplots” function defined in the R scripts. This function takes the following arguments.
Analysis data (analysis_data): This is the analysis data used for the modeling in the primary analysis. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Chemical annotation data (chem_data): The chemical annotation data. This argument is necessary to link the columns of the analysis data to the subpathways. If you followed this pipeline all the way through, the analysis data is saved in the Data/Processed folder.
Pathway (pathway): This is the name of the subpathway of interest.
X: This the name of the variable in the metadata that is used for the X axis of the line plots.
Group By (groupBy): This is a grouping variable. As a recommendation the treatment groups should be used in the groupBy argument as this will provide a different color for each of the treatments making it easier to identify.
Line plot example
Below is a demonstration of utilizing the line plots to see trends between treatment groups at the metabolite level.
# 1. Read in analysis data
analysis_data <- read.csv(analysis, check.names = F)
# 2. Reach in chemical annotation data
# Read chemical annotation file
chem_data <- read.xlsx(met_excel,sheet = "Chemical Annotation")
# 3. Prep data for line plots
analysis_data <- analysis_data %>%
filter(Gender=="Male") %>%
mutate(TIME1 = case_when(TIME1== "PreSymp" ~ 1,
TIME1 =="Onset" ~ 2,
TIME1 == "End" ~ 3))- 1
- Read in the analysis data
- 2
- Read in the chemical annotation data
- 3
- Prep data for line plots. To produce the lines we must provide a numeric variable for the x-axis. If you are looking at trends across an ordinal categorical variable you will need to convert the variable to be numeric. For example, if we want to see the trend in time between treatment groups, where time takes the ordinal values of “Pre-symptomatic”, “onset of symptoms”, and “end of symptoms”, then we will need to assign these names to the values of 1,2, and 3 respectively. Additionally, we will need to make any other desired adjustments to our data at this step.
# 4. Run subpathway_line plots function.
line_plots <- subpathway_lineplots(analysis_data = analysis_data,
chem_data = chem_data,
subpathway = subpath,
X = TIME1,
groupBy = GROUP_NAME)
# 5. Edit asthetics of plots
line_plots +
scale_x_continuous(breaks = c(1, 2, 3), labels = c("PreSymp", "Onset", "End")) +
labs(x= "Time",y="Scaled Intensity",color = "Treatment Group")- 6
- Save the plots in the folder “Outputs/Plots” under the file name “linePlots_{{subpathway}} where {{subpathway}} is the name of the subpathway defined in step 4.
`geom_smooth()` using formula = 'y ~ x'
# 6. Save plots
ggsave(filename = paste0(plot_save ,"/LinePlots_",subpath,".pdf"))Saving 12 x 12 in image
`geom_smooth()` using formula = 'y ~ x'